Building a Large Automatically Parsed Corpus of Finnish
نویسندگان
چکیده
We describe the methods and resources used to build FinnTreeBank-3, a 76.4 million token corpus of Finnish with automatically produced morphological and dependency syntax analyses. Starting from a definition of the target dependency scheme, we show how existing resources are transformed to conform to this definition and subsequently used to develop a parsing pipeline capable of processing a large-scale corpus. An independent formal evaluation demonstrates high accuracy of both morphological and syntactic annotation layers. The parsed corpus is freely available within the FIN-CLARIN infrastructure project.
منابع مشابه
Parsed Corpora for Linguistics
Knowledge-based parsers are now accurate, fast and robust enough to be used to obtain syntactic annotations for very large corpora fully automatically. We argue that such parsed corpora are an interesting new resource for linguists. The argument is illustrated by means of a number of recent results which were established with the help of parsed corpora.
متن کاملDCG Induction using MDL and Parsed
We show how partial models of natural language syntax (manually written DCGs, with parameters estimated from a parsed corpus) can be automatically extended when trained upon raw text (using MDL). We also show how we can use a parsed corpus as an alternative constraint upon estimation. Empirical evaluation suggests that a parsed corpus is more informative than a MDL-based prior. However , best r...
متن کاملDependencies vs. Constituents for Tree-Based Alignment
Given a parallel parsed corpus, statistical treeto-tree alignment attempts to match nodes in the syntactic trees for a given sentence in two languages. We train a probabilistic tree transduction model on a large automatically parsed Chinese-English corpus, and evaluate results against human-annotated word level alignments. We find that a constituent-based model performs better than a similar pr...
متن کاملRobust VPE Detection using Automatically Parsed Text
This paper describes a Verb Phrase Ellipsis (VPE) detection system, built for robustness, accuracy and domain independence. The system is corpus-based, and uses machine learning techniques on free text that has been automatically parsed. Tested on a mixed corpus comprising a range of genres, the system achieves a 70% F1-score. This system is designed as the first stage of a complete VPE resolut...
متن کاملHuge Parsed Corpora in LASSY
One of the goals of the LASSY STEVIN project (Large Scale Syntactic Annotation of written Dutch) is a syntactically annotated (manually verified) corpus of 1 million words. In addition, the full STEVIN reference corpus of 500 million words will be syntactically annotated automatically. In this paper, the potential of such huge treebanks for applications in corpus linguistics, natural language p...
متن کامل